Filtering lead instrument

Goal

We are going to generate some music with more than one synthesizer

We will filter out the lead tone using a feed-forward neural network.

Model: input wave with 3 instruments -> output wave 1 instrument

We will use a auto-encoder like setup. Replace the image by a short fragment of 1024 samples (~1/40th of a second) of sound data.

Image source

Training and target

In [8]:
Audio(input_track[0:8*sr], rate=sr)
Out[8]:
In [11]:
Audio(target_track[0:8*sr], rate=sr)
Out[11]:

In [16]:
model = keras.models.Sequential()
model.add(Dense(1024, input_shape=input_shape))
model.add(PReLU())
model.add(Dense(512))
model.add(PReLU())
model.add(Dense(output_shape))
model.compile(Adam(), 'mse')
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 1024)              1049600   
_________________________________________________________________
p_re_lu_1 (PReLU)            (None, 1024)              1024      
_________________________________________________________________
dense_2 (Dense)              (None, 512)               524800    
_________________________________________________________________
p_re_lu_2 (PReLU)            (None, 512)               512       
_________________________________________________________________
dense_3 (Dense)              (None, 1024)              525312    
=================================================================
Total params: 2,101,248
Trainable params: 2,101,248
Non-trainable params: 0
_________________________________________________________________

How does the network sound before training?

In [20]:
Audio(model_predict(model, mix)[0:15*sr], rate=sr)
Out[20]:

Fit the model in two epochs

In [21]:
model.fit(x, y, epochs=2)
Epoch 1/2
40960/40960 [==============================] - 20s 482us/step - loss: 0.0047
Epoch 2/2
40960/40960 [==============================] - 20s 494us/step - loss: 0.0029
Out[21]:
<keras.callbacks.History at 0x1272d4b70>

Let's test the model

In [25]:
display(Audio(mix[40*sr:45*sr], rate=sr))
display(Audio(model_predict(model, mix)[40*sr:45*sr], rate=sr))

Is it overfitted?

In [29]:
score_tracks_test, audio_tracks_test, mix_test = \
    generate_dataset(n_measures=64,
                     tempo=Tempo(120),
                     scale=GenericScale('E', [0, 1, 4, 5, 7, 8, 10]),
                     sampling_info=sampling_info)
In [28]:
Audio(model_predict(model, mix_test[0:15*44100]), rate=sr)
Out[28]: